Overview

Dataset Statistics

Number of Variables 19
Number of Rows 1.1024e+07
Missing Cells 1.7063e+07
Missing Cells (%) 8.1%
Duplicate Rows 124
Duplicate Rows (%) 0.0%
Total Size in Memory 12.2 GB
Average Row Size in Memory 1.2 KB
Variable Types
  • Categorical: 13
  • GeoGraphy: 1
  • Numerical: 5

Dataset Insights

director has 3960615 (35.93%) missing values Missing
cast has 2550859 (23.14%) missing values Missing
country has 5510080 (49.98%) missing values Missing
date_added has 4577546 (41.52%) missing values Missing
duration_int has 231117 (2.1%) missing values Missing
duration_type has 231117 (2.1%) missing values Missing
release_year is skewed Skewed
duration_int is skewed Skewed
calificacion is skewed Skewed
show_id has a high cardinality: 9668 distinct values High Cardinality
title has a high cardinality: 22042 distinct values High Cardinality
director has a high cardinality: 10095 distinct values High Cardinality
cast has a high cardinality: 16744 distinct values High Cardinality
country has a high cardinality: 886 distinct values High Cardinality
date_added has a high cardinality: 2003 distinct values High Cardinality
listed_in has a high cardinality: 1687 distinct values High Cardinality
description has a high cardinality: 22669 distinct values High Cardinality
id has a high cardinality: 22998 distinct values High Cardinality
clasificacion has a high cardinality: 105 distinct values High Cardinality
datetime has a high cardinality: 7751 distinct values High Cardinality
date_added has constant length 10.0 Constant Length
datetime has constant length 10 Constant Length
califPromedio has constant length 3 Constant Length
  • 1
  • 2
  • 3

Variables


show_id

categorical

Approximate Distinct Count 9668
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Memory Size 769577112

Length

Mean 4.8074
Standard Deviation 0.4429
Median 5
Minimum 2
Maximum 5

Sample

1st row s1
2nd row s1
3rd row s1
4th row s1
5th row s1

Letter

Count 11024289
Lowercase Letter 11024289
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 41974038
  • show_id contains many words: 9668 words

type

categorical

Approximate Distinct Count 2
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 777954306
  • The largest value (movie) is over 2.53 times larger than the second largest value (tv show)

Length

Mean 5.5673
Standard Deviation 0.9015
Median 7
Minimum 5
Maximum 7

Sample

1st row movie
2nd row movie
3rd row movie
4th row movie
5th row movie

Letter

Count 58248483
Lowercase Letter 58248483
Space Separator 3127038
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (movie, tv show) take over 50.0%
  • The largest value (movie) is over 2.53 times larger than the second largest value (show)

title

categorical

Approximate Distinct Count 22042
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Memory Size 933605037

Length

Mean 18.6761
Standard Deviation 11.605
Median 25
Minimum 1
Maximum 109

Sample

1st row the grand seductio...
2nd row the grand seductio...
3rd row the grand seductio...
4th row the grand seductio...
5th row the grand seductio...

Letter

Count 174678329
Lowercase Letter 174678329
Space Separator 24870275
Uppercase Letter 0
Dash Punctuation 526689
Decimal Number 1589869
  • title contains many words: 16635 words

director

categorical

Approximate Distinct Count 10095
Approximate Unique (%) 0.1%
Missing 3960615
Missing (%) 35.9%
Memory Size 577373838
  • The largest value (mark knight) is over 1.85 times larger than the second largest value (cannis holder)

Length

Mean 15.0419
Standard Deviation 11.8347
Median 15
Minimum 1
Maximum 1024

Sample

1st row don mckellar
2nd row don mckellar
3rd row don mckellar
4th row don mckellar
5th row don mckellar

Letter

Count 94903789
Lowercase Letter 94903789
Space Separator 9479908
Uppercase Letter 0
Dash Punctuation 172757
Decimal Number 14355
  • director contains many words: 11552 words

cast

categorical

Approximate Distinct Count 16744
Approximate Unique (%) 0.2%
Missing 2550859
Missing (%) 23.1%
Memory Size 1527755513
  • The largest value (maggie binkley) is over 1.64 times larger than the second largest value (1)

Length

Mean 95.7389
Standard Deviation 75.9464
Median 131
Minimum 1
Maximum 1099

Sample

1st row brendan gleeson, t...
2nd row brendan gleeson, t...
3rd row brendan gleeson, t...
4th row brendan gleeson, t...
5th row brendan gleeson, t...

Letter

Count 654070975
Lowercase Letter 654070975
Space Separator 104289870
Uppercase Letter 0
Dash Punctuation 1668492
Decimal Number 99835
  • cast contains many words: 46654 words

country

categorical

Approximate Distinct Count 886
Approximate Unique (%) 0.0%
Missing 5510080
Missing (%) 50.0%
Memory Size 430150815
  • The largest value (united states) is over 4.15 times larger than the second largest value (india)

Length

Mean 13.0077
Standard Deviation 8.5042
Median 13
Minimum 4
Maximum 137

Sample

1st row canada
2nd row canada
3rd row canada
4th row canada
5th row canada

Letter

Count 65100049
Lowercase Letter 65100049
Space Separator 5256000
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0

date_added

categorical

Approximate Distinct Count 2003
Approximate Unique (%) 0.0%
Missing 4577546
Missing (%) 41.5%
Memory Size 483505725
  • The largest value (2019-11-12) is over 4.96 times larger than the second largest value (2020-01-01)

Length

Mean 10
Standard Deviation 0
Median 10
Minimum 10
Maximum 10

Sample

1st row 2021-03-30
2nd row 2021-03-30
3rd row 2021-03-30
4th row 2021-03-30
5th row 2021-03-30

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 12893486
Decimal Number 51573944
  • date_added contains many words: 2003 words
  • The largest value (20191112) is over 4.96 times larger than the second largest value (20200101)
  • date_added has words of constant length

release_year

numerical

Approximate Distinct Count 101
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 176388624
Mean 2010.819
Minimum 1920
Maximum 2021
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • release_year is skewed left (γ1 = -2.9206)

Quantile Statistics

Minimum 1920
5-th Percentile 1981
Q1 2010
Median 2016
Q3 2019
95-th Percentile 2021
Maximum 2021
Range 101
IQR 9

Descriptive Statistics

Mean 2010.819
Standard Deviation 15.3866
Variance 236.7483
Sum 2.2168e+10
Skewness -2.9206
Kurtosis 9.2768
Coefficient of Variation 0.007652
  • release_year is not normally distributed (p-value 1.3388268834249549e-17)
  • release_year has 1178507 outliers

listed_in

categorical

Approximate Distinct Count 1687
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 983128555
  • The largest value (drama) is over 1.75 times larger than the second largest value (comedy)

Length

Mean 24.1784
Standard Deviation 14.8227
Median 38
Minimum 4
Maximum 79

Sample

1st row comedy, drama
2nd row comedy, drama
3rd row comedy, drama
4th row comedy, drama
5th row comedy, drama

Letter

Count 227977950
Lowercase Letter 227977950
Space Separator 24045979
Uppercase Letter 0
Dash Punctuation 648266
Decimal Number 0

description

categorical

Approximate Distinct Count 22669
Approximate Unique (%) 0.2%
Missing 1815
Missing (%) 0.0%
Memory Size 3587201031

Length

Mean 200.9349
Standard Deviation 120.2033
Median 235
Minimum 1
Maximum 1870

Sample

1st row a small fishing vi...
2nd row a small fishing vi...
3rd row a small fishing vi...
4th row a small fishing vi...
5th row a small fishing vi...

Letter

Count 1791322616
Lowercase Letter 1791322616
Space Separator 361205023
Uppercase Letter 0
Dash Punctuation 5855593
Decimal Number 5956398
  • description contains many words: 43555 words

id

categorical

Approximate Distinct Count 22998
Approximate Unique (%) 0.2%
Missing 0
Missing (%) 0.0%
Memory Size 780601401

Length

Mean 5.8074
Standard Deviation 0.4429
Median 6
Minimum 3
Maximum 6

Sample

1st row as1
2nd row as1
3rd row as1
4th row as1
5th row as1

Letter

Count 22048578
Lowercase Letter 22048578
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 41974038
  • id contains many words: 22998 words

duration_int

numerical

Approximate Distinct Count 225
Approximate Unique (%) 0.0%
Missing 231117
Missing (%) 2.1%
Infinite 0
Infinite (%) 0.0%
Memory Size 172690752
Mean 67.1062
Minimum 0
Maximum 601
Zeros 4684
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • duration_int is skewed right (γ1 = 0.6863)

Quantile Statistics

Minimum 0
5-th Percentile 1
Q1 4
Median 85
Q3 102
95-th Percentile 139
Maximum 601
Range 601
IQR 98

Descriptive Statistics

Mean 67.1062
Standard Deviation 51.4006
Variance 2642.0239
Sum 7.2429e+08
Skewness 0.6863
Kurtosis 5.9201
Coefficient of Variation 0.766
  • duration_int is not normally distributed (p-value 6.047979110483523e-16)
  • duration_int has 14756 outliers

duration_type

categorical

Approximate Distinct Count 3
Approximate Unique (%) 0.0%
Missing 231117
Missing (%) 2.1%
Memory Size 744438064
  • The largest value (min) is over 3.82 times larger than the second largest value (season)

Length

Mean 3.9731
Standard Deviation 1.5453
Median 6
Minimum 3
Maximum 7

Sample

1st row min
2nd row min
3rd row min
4th row min
5th row min

Letter

Count 42881884
Lowercase Letter 42881884
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 0
  • The top 2 categories (min, season) take over 50.0%
  • The largest value (min) is over 3.82 times larger than the second largest value (season)

clasificacion

categorical

Approximate Distinct Count 105
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 756681648

Length

Mean 3.6377
Standard Deviation 1.5272
Median 5
Minimum 1
Maximum 10

Sample

1st row g
2nd row g
3rd row g
4th row g
5th row g

Letter

Count 22377406
Lowercase Letter 22377406
Space Separator 142112
Uppercase Letter 0
Dash Punctuation 5518807
Decimal Number 9523231

userId

numerical

Approximate Distinct Count 115077
Approximate Unique (%) 1.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 176388624
Mean 89972.509
Minimum 1
Maximum 270896
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • userId is skewed right (γ1 = 1.1708)

Quantile Statistics

Minimum 1
5-th Percentile 5756.2
Q1 29023.75
Median 57184
Q3 117268
95-th Percentile 265326.6
Maximum 270896
Range 270895
IQR 88244.25

Descriptive Statistics

Mean 89972.509
Standard Deviation 86866.0139
Variance 7.5457e+09
Sum 9.9188e+11
Skewness 1.1708
Kurtosis -0.0921
Coefficient of Variation 0.9655
  • userId is not normally distributed (p-value 0.0)
  • userId has 1982627 outliers

timestamp

numerical

Approximate Distinct Count 8848097
Approximate Unique (%) 80.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 176388624
Mean 1.1725e+09
Minimum 789652004
Maximum 1501826675
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • timestamp is skewed right (γ1 = 0.0661)

Quantile Statistics

Minimum 789652004
5-th Percentile 8.4651e+08
Q1 9.9264e+08
Median 1.1564e+09
Q3 1.3654e+09
95-th Percentile 1.4842e+09
Maximum 1501826675
Range 712174671
IQR 3.7277e+08

Descriptive Statistics

Mean 1.1725e+09
Standard Deviation 2.0568e+08
Variance 4.2303e+16
Sum 1.2926e+16
Skewness 0.06605
Kurtosis -1.226
Coefficient of Variation 0.1754

datetime

categorical

Approximate Distinct Count 7751
Approximate Unique (%) 0.1%
Missing 0
Missing (%) 0.0%
Memory Size 826821675

Length

Mean 10
Standard Deviation 0
Median 10
Minimum 10
Maximum 10

Sample

1st row 2003-07-30
2nd row 1996-08-13
3rd row 2001-01-03
4th row 2012-06-25
5th row 2000-03-30

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 22048578
Decimal Number 88194312
  • datetime contains many words: 7751 words
  • datetime has words of constant length

calificacion

numerical

Approximate Distinct Count 10
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 176388624
Mean 3.5335
Minimum 0.5
Maximum 5
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • calificacion is skewed left (γ1 = -0.6901)

Quantile Statistics

Minimum 0.5
5-th Percentile 1.5
Q1 3
Median 4
Q3 4
95-th Percentile 5
Maximum 5
Range 4.5
IQR 1

Descriptive Statistics

Mean 3.5335
Standard Deviation 1.0597
Variance 1.1229
Sum 3.8954e+07
Skewness -0.6901
Kurtosis 0.2014
Coefficient of Variation 0.2999
  • calificacion is not normally distributed (p-value 3.7116983653424804e-13)
  • calificacion has 514764 outliers

califPromedio

categorical

Approximate Distinct Count 5
Approximate Unique (%) 0.0%
Missing 0
Missing (%) 0.0%
Memory Size 749651652
  • The largest value (3.5) is over 1.64 times larger than the second largest value (3.6)

Length

Mean 3
Standard Deviation 0
Median 3
Minimum 3
Maximum 3

Sample

1st row 3.5
2nd row 3.5
3rd row 3.5
4th row 3.5
5th row 3.5

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 22048578
  • The top 2 categories (3.5, 3.6) take over 50.0%
  • The largest value (35) is over 1.64 times larger than the second largest value (36)
  • califPromedio has words of constant length

Interactions

Correlations

Missing Values